Analysis on active minutes, calorie, and total steps.
The American Heart Association and World Health Organization recommend at least 150 minutes of moderate-intensity activity or 75 minutes of vigorous activity, or a combination of both, each week. That means it needs an daily goal of 21.4 minutes of FairlyActiveMinutes or 10.7 minutes of VeryActiveMinutes.
Active users
active_users <- daily_activity %>%
filter(FairlyActiveMinutes >= 21.4 | VeryActiveMinutes>=10.7) %>%
group_by(Id) %>%
count(Id)
active_users
- 30 users met the criteria of fairly active minutes or very active minutes.
Creating variables for % of Different Activity Level Minutes
total_minutes <- sum(daily_activity$SedentaryMinutes, daily_activity$VeryActiveMinutes, daily_activity$FairlyActiveMinutes, daily_activity$LightlyActiveMinutes)
sedentary_percentage <- sum(daily_activity$SedentaryMinutes)/total_minutes*100
lightly_percentage <- sum(daily_activity$LightlyActiveMinutes)/total_minutes*100
fairly_percentage <- sum(daily_activity$FairlyActiveMinutes)/total_minutes*100
active_percentage <- sum(daily_activity$VeryActiveMinutes)/total_minutes*100
Pie chart showing % of Different Activity Level Minutes
percentage <- data.frame(
level=c("Sedentary", "Lightly", "Fairly", "Very Active"),
minutes=c(sedentary_percentage,lightly_percentage,fairly_percentage,active_percentage))
plot_ly(percentage, labels = ~level, values = ~minutes, type = 'pie',textposition = 'outside',textinfo = 'label+percent') %>%
layout(title = 'Activity Level Minutes',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
- Sedentary Minutes (81.3%) occupy the highest proportion of the Total Minutes, meaning the users mostly spend time being inactive.
- Fairly Active Minutes (1.11%) and Very Active Minutes (1.74%) occupy a very less proportion of the Total Minutes, meaning the users are active for very less time.
How active are the users

- Sedentary Minutes have the most widely spread values in the dataset.
- Fairly Active Minutes and Very Active Minutes have quite a few outliers in the dataset.
Total steps vs Sedentary Minutes with Calories and Total Distance
par(mfrow = c(2, 2))
ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color=Calories))+
geom_point()+
stat_smooth(method=lm)+
scale_color_gradient(low="blue", high="yellow")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?

ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color=TotalDistance))+
geom_point()+
stat_smooth(method=lm)+
scale_color_gradient(low="blue", high="yellow")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?

- The two plots are very similar.
- Users who are more active burn more calories, whereas users who are sedentary take the less steps and burn less calories.
Interesting find here that some user who are sedentary, takes minimal step, but still able to burn over 1500 to 2500 calories
ggplot(data=daily_activity, aes(x=TotalSteps, y = Calories, color=SedentaryMinutes))+
geom_point()+
labs(title="Total Steps vs Calories")+
xlab("Total Steps")+
stat_smooth(method=lm)+
scale_color_gradient(low="orange", high="steelblue")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?

- The more active a user is, the more steps they take, and the more calories they burn. This is an obvious fact, but the same was verified using data.
- It was observed that some users who are sedentary, take minimal steps, but still able to burn over 1500 to 2500 calories as compared to users who are more active, take more steps, but still burn similar calories.
Users who take more steps, burn more calories and has lower BMI
ggplot(data=merged_data, aes(x=TotalSteps, y = BMI, color=Calories))+
geom_point()+
stat_smooth(method=lm)+
scale_color_gradient(low="blue", high="yellow")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 8881 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 8881 rows containing missing values (`geom_point()`).

- Users who take more steps, burn more calories and has lower BMI
- There are some outliers in the top left corner.
Regression analysis and R value, leverage points (lm.influence)
The lm() analysis, gives information about the the R-squared value. 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. Postive slope means variables increase/decrease with each other, and negative means one variable goes up and the other goes down.
step_vs_sedentary.mod <- lm(SedentaryMinutes ~ TotalSteps, data = merged_data)
summary(step_vs_sedentary.mod)
##
## Call:
## lm(formula = SedentaryMinutes ~ TotalSteps, data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -811.33 -63.62 -37.76 41.37 742.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 811.4939381 2.3536052 344.79 <0.0000000000000002 ***
## TotalSteps -0.0094864 0.0002287 -41.48 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 202.5 on 43387 degrees of freedom
## Multiple R-squared: 0.03815, Adjusted R-squared: 0.03813
## F-statistic: 1721 on 1 and 43387 DF, p-value: < 0.00000000000000022
- Sedentary Minutes decrease by 0.0094864 minutes for every 1 step increase in Total Steps (or 9.49 Sedentary Minutes decrease for every 1000 step increase in Total Steps).
- Total Steps explained around 3.81% variation in the Sedentary Minutes.
- p value is less than the significance level, hence the results are statistically significant.
bmi_vs_steps.mod <- lm(BMI ~ TotalSteps, data = merged_data)
summary(bmi_vs_steps.mod)
##
## Call:
## lm(formula = BMI ~ TotalSteps, data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6517 -0.7069 -0.3289 -0.0292 22.5574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.339021309 0.026110686 970.45 <0.0000000000000002 ***
## TotalSteps -0.000094039 0.000002463 -38.19 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.862 on 34506 degrees of freedom
## (8881 observations deleted due to missingness)
## Multiple R-squared: 0.04055, Adjusted R-squared: 0.04052
## F-statistic: 1458 on 1 and 34506 DF, p-value: < 0.00000000000000022
- BMI decrease by 0.000094039 for every 1 step increase in Total Steps (or 9.40 BMI decrease for every 100000 step increase in Total Steps).
- Total Steps explained around 4.05% variation in the BMI.
- p value is less than the significance level, hence the results are statistically significant.
calories_vs_steps.mod <- lm(Calories ~ TotalSteps, data = merged_data)
summary(calories_vs_steps.mod)
##
## Call:
## lm(formula = Calories ~ TotalSteps, data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1478.95 -176.96 -116.26 14.13 2258.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1478.9532481 5.2933996 279.4 <0.0000000000000002 ***
## TotalSteps 0.0666051 0.0005143 129.5 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 455.5 on 43387 degrees of freedom
## Multiple R-squared: 0.2788, Adjusted R-squared: 0.2788
## F-statistic: 1.677e+04 on 1 and 43387 DF, p-value: < 0.00000000000000022
- Calories increase by 0.067 units for every 1 step increase in Total Steps (or 67 units Calories increase for every 1000 step increase in Total Steps).
- Total Steps explained around 27.9% variation in the Calories.
- p value is less than the significance level, hence the results are statistically significant.
veryactive_vs_sleep.mod <- lm(VeryActiveMinutes ~ TotalMinutesAsleep, data = merged_data)
summary(veryactive_vs_sleep.mod)
##
## Call:
## lm(formula = VeryActiveMinutes ~ TotalMinutesAsleep, data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.500 -22.737 -7.984 14.862 187.401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.595768 0.582829 40.485 <0.0000000000000002 ***
## TotalMinutesAsleep -0.001652 0.001313 -1.258 0.208
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.26 on 42416 degrees of freedom
## (971 observations deleted due to missingness)
## Multiple R-squared: 3.732e-05, Adjusted R-squared: 1.374e-05
## F-statistic: 1.583 on 1 and 42416 DF, p-value: 0.2084
- Very Active Minutes decrease by 0.002 for every 1 Minute increase in Total Minutes Asleep (or 2 Very Active Minutes decrease for every 1000 Minute increase in Total Minutes Asleep).
- p value is greater than the significance level, hence the results are not statistically significant.
The high volume of moderate-to-vigorous physical activity is achieved by a very small proportion of the population
active_minutes_vs_calories <- ggplot(data = merged_data) +
geom_point(mapping=aes(x=Calories, y=FairlyActiveMinutes), color = "maroon", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x, mapping=aes(x=Calories, y=FairlyActiveMinutes, color=FairlyActiveMinutes), color = "maroon", se = FALSE) +
geom_point(mapping=aes(x=Calories, y=VeryActiveMinutes), color = "forestgreen", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=Calories, y=VeryActiveMinutes, color=VeryActiveMinutes), color = "forestgreen", se = FALSE) +
geom_point(mapping=aes(x=Calories, y=LightlyActiveMinutes), color = "orange", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=Calories, y=LightlyActiveMinutes, color=LightlyActiveMinutes), color = "orange", se = FALSE) +
geom_point(mapping=aes(x=Calories, y=SedentaryMinutes), color = "steelblue", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=Calories, y=SedentaryMinutes, color=SedentaryeMinutes), color = "steelblue", se = FALSE) +
annotate("text", x=4800, y=160, label="Very Active", color="black", size=3)+
annotate("text", x=4800, y=0, label="Fairly Active", color="black", size=3)+
annotate("text", x=4800, y=500, label="Sedentary", color="black", size=3)+
annotate("text", x=4800, y=250, label="Lightly Active", color="black", size=3)+
labs(x = "Calories", y = "Active Minutes", title="Calories vs Active Minutes")
active_minutes_vs_calories

- According to this healthline.com article, moderately active woman between the ages of 26-50 needs to eat about 2,000 calories per day and moderately active man between the ages of 26-45 needs 2,600 calories per day to maintain his weight.
- Comparing the four active levels to the calories, we see most data is concentrated on users who burn 2000 to 3000 calories a day.
- These users also spent an average between 8 to 13 hours in sedentary, 5 hours in lightly active, and 1 to 2 hour for fairly and very active.
- Additionally, we see that the sedentary line is leveling off toward the end while fairly + very active line is curling back up.
- This indicates that the users who burn more calories spend less time in sedentary, more time in fairly + active.
active_minutes_vs_steps <- ggplot(data = merged_data) +
geom_point(mapping=aes(x=TotalSteps, y=FairlyActiveMinutes), color = "maroon", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x, mapping=aes(x=TotalSteps, y=FairlyActiveMinutes, color=FairlyActiveMinutes), color = "maroon", se = FALSE) +
geom_point(mapping=aes(x=TotalSteps, y=VeryActiveMinutes), color = "forestgreen", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=TotalSteps, y=VeryActiveMinutes, color=VeryActiveMinutes), color = "forestgreen", se = FALSE) +
geom_point(mapping=aes(x=TotalSteps, y=LightlyActiveMinutes), color = "orange", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=TotalSteps, y=LightlyActiveMinutes, color=LightlyActiveMinutes), color = "orange", se = FALSE) +
geom_point(mapping=aes(x=TotalSteps, y=SedentaryMinutes), color = "steelblue", alpha = 1/3) +
geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=TotalSteps, y=SedentaryMinutes, color=SedentaryMinutes), color = "steelblue", se = FALSE) +
annotate("text", x=35000, y=150, label="Very Active", color="black", size=3)+
annotate("text", x=35000, y=50, label="Fairly Active", color="black", size=3)+
annotate("text", x=35000, y=1350, label="Sedentary", color="black", size=3)+
annotate("text", x=35000, y=380, label="Lightly Active", color="black", size=3)+
labs(x = "Total Steps", y = "Active Minutes", title="Steps vs Active Minutes")
active_minutes_vs_steps

- Comparing the four active levels to the total steps, it can be seen that most data is concentrated on users who take about 5000 to 15000 steps a day.
- These users spent an average between 8 to 13 hours in sedentary, 5 hours in lightly active, and 1 to 2 hour for fairly and very active minutes respectively.
Analysis on sleep
Converting the Sleep time in hours instead of minutes
sleep_day_in_hour <-sleep_day
sleep_day_in_hour$TotalMinutesAsleep <- sleep_day_in_hour$TotalMinutesAsleep/60
sleep_day_in_hour$TotalTimeInBed <- sleep_day_in_hour$TotalTimeInBed/60
head(sleep_day_in_hour)
Checking for any sleep outliers
Number of times user sleep or spend time in bed for more than 10 hours
sum(sleep_day_in_hour$TotalMinutesAsleep > 10)
## [1] 18
sum(sleep_day_in_hour$TotalTimeInBed > 10)
## [1] 30
Number of times user sleep or spend time in bed for less than 1 hour
sum(sleep_day_in_hour$TotalMinutesAsleep < 1)
## [1] 2
sum(sleep_day_in_hour$TotalTimeInBed < 1)
## [1] 0
Referring this article, 55 minutes are spend awake in bed before going to sleep.
Let see how many users in this analysis corresponds to the FitBit data
awake_in_bed <- mutate(sleep_day, AwakeTime = TotalTimeInBed - TotalMinutesAsleep)
awake_in_bed <- awake_in_bed %>%
filter(AwakeTime >= 55) %>%
group_by(Id) %>%
arrange(AwakeTime, desc=TRUE)
n_distinct(awake_in_bed$Id)
## [1] 13
- 13 users spend more than 55 minutes in bed before falling alseep
How many minutes an user sleep may not correlate well with how actively they are, but sedentary time account for about 80% of during the day
Using Regression Analysis to find if users who spend more time in sedentary minutes also spend more time sleeping
sedentary_vs_sleep.mod <- lm(SedentaryMinutes ~ TotalMinutesAsleep, data = merged_data)
summary(sedentary_vs_sleep.mod)
##
## Call:
## lm(formula = SedentaryMinutes ~ TotalMinutesAsleep, data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -878.84 -76.54 -17.80 42.03 866.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 904.88714 4.48547 201.74 <0.0000000000000002 ***
## TotalMinutesAsleep -0.44156 0.01011 -43.69 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 194.4 on 42416 degrees of freedom
## (971 observations deleted due to missingness)
## Multiple R-squared: 0.04306, Adjusted R-squared: 0.04304
## F-statistic: 1909 on 1 and 42416 DF, p-value: < 0.00000000000000022
- Sedentary Minutes decrease by 0.442 for every 1 Minute increase in Total Minutes Asleep (or 44.2 Sedentary Minutes decrease for every 100 Minute increase in Total Minutes Asleep).
- Total Minutes Asleep explained around 4.30% variation in the Sedentary Minutes.
- p value is less than the significance level, hence the results are statistically significant.
Finding the relationship between Total Minutes Asleep and Calories by Total Steps to find out “Do people sleep more burn less calories?”
ggplot(data=merged_data, aes(x=TotalMinutesAsleep/60, y=Calories, color=TotalSteps))+
geom_point()+
labs(title="Total Minutes Asleep vs Calories")+
xlab("Total Minutes Alseep")+
stat_smooth(method=lm)+
scale_color_gradient(low="orange", high="steelblue")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 971 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 971 rows containing missing values (`geom_point()`).

- Majority of the users sleep between 5 to 10 hours burns around 1500 to 4500 calories a day.
- There is not much a correlation.